Abstract

This project consisted of investigating the relationship between a variety of predictive factors and the success of four mice in a neuroscience experiment. Goals included determining the data structure, exploring the data to understand the relationship between all variables, integrating the data into a usable format, and building a predictive model with this integrated data. This model was used to predict whether a experiment trial would be a success or a failure, depending on a variety of factors. Several interesting relationships were discovered during exploratory data analysis, and multiple models were made using multiple statistical modeling methods. Conclusions were made based on a final model, which was chosen with predefined criteria. This model was evaluated on a final set of test data, to determine its accuracy and reliability regarding its predictive performance.

Introduction

The topic of this project comes from data collected during the study “Distributed coding of choice, action, and engagement across the mouse brain”, by Steinmetz et al. (2019). This experiment involves the observation of 10 different mice across 39 different sessions, where each session involves many sequential trials where different stimuli is administered to the mouse. Within a trial, each mouse is given a visual stimuli through the medium of two screens on either side of the mouse, where the screens differed in their ‘contrast levels’. Contrast levels consisted of 0, 0.25, 0.5, and 1, ordered from no stimuli (0) to maximum contrast (1). Given these base conditions, each mouse made a decision (labelled as feedback), which was made by moving a wheel controlled by their forepaws. Rewards and penalties were administered based on the wheel decision, depending on the contrast conditions. A 1 denotes a successful trial (where the correct decision was made), whereas a -1 represents a failure (the incorrect decision was made). When contrast on the left screen had a greater assigned value than the contrast on the right screen, turning the wheel to the right represented a success while any other decision represented a failure. In this same way, when right contrast exceeds left contrast, success was given for turning the wheel to the left, and failure otherwise. If left and right contrast were both 0, moving the wheel at all was considered a failure. On the other hand, if left and right contrast were not 0 but both equal, success would be randomly chosen as either rotating the wheel to the left or the right.

To explore how the brain reacted to the different trials and stimuli conditions, the researchers recorded the activity of neurons within the visual cortex of each mouse, recording data from several different brain areas. This neuron information is represented by ‘spike trains’, which show neurons firing across pre-specified time bins. In this paper, the focus will be on the spike trains of the first 0.4 seconds after stimuli was administered. To simplify the data analysis and modeling process, this investigation will additionally focus on sessions 1 through 18, which have Cori, Frossman, Hence, and Lederberg as our mice subjects.

In designing this project, there were several influencing factors to consider when deciding how to begin. The data is not packaged in a traditional data format (such as CSV), but rather through an assymetric “list of lists”, with varying dimensions between sessions and between variables. Each session has a different number of trials and a different number of neurons (from differing brain areas), making preliminary cleaning and analysis complex. Additionally, each mouse participated in a different number of sessions, adding an additional layer of complexity to the data structure. Besides this, the format for much of the recorded data (such as spike trains) would seem impractical for analysis in their given formats, making data restructuring potentially necessary.

The main goal of this project is to build a predictive model for feedback (or whether a trial was a success or failure), based on the best collection of predictors. Analysis and modeling focuses especially on the contrast levels and neural activity, which would logically be most correlated with a successful or unsuccessful trial. This goal was constructed with the desired result for better understanding of the data and how to predict its outcome, and a better understanding of the underlying statistical techniques to achieve this.

Upon preliminary reflection over the expirement used for this model building, including its design, execution, and subsequently collected data, there are several potential approaches to be taken when beginning to construct a model. With a variety of potential predictor variables and a plethora of statistical methods to employ to achieve the end goal of modeling feedback, exploratory data analysis is necessary to gain a deeper understanding of the underlying data structure, and to begin identifying potential relationships between variables.

Exploratory Analysis

To begin exploring the data recorded within the first 18 sessions, session information was downloaded from the various session RDS files and stored in a list, where each index of the list corresponded with a specific session. There are several variables attached to each session (with information pertaining to each trial). These included feedback_type, contrast_left, contrast_right, time, spks, and brain_area. feedback_type gives information on whether a trial was a success or failure, contrast_left and contrast right give information about the respective stimuli for each screen, time represents the center of the time bins for the spks variable, spks represents the number of neuron spikes, and brain_area represents which part of the brain each recorded neuron resides in.

With a basic understanding of the underlying data structure, preliminary analysis began by considering the number of neurons and number of trials for each session.

As pictured above, each mouse participated in a varying number of trials per session (and each mouse participated in a varying number of sessions). Additionally, Each mouse had a varying number of active neurons per session. This information is valuable as it can show a potential performance depending on session number or mouse, as different mice were subjected to different amounts of trials (under the hypothesis that a mouse might get tired as trials continue and start performing worse).

Before further exploring these potential relationships, it was necessary to further explore spike train data due to the complexity of neural information. With a varying number of neurons per session and trial, it was important to find a way to compare neurons across trials and sessions. From some brief exploratory analysis, it was clear that brain_area added a high degree of complexity to the analysis of neural activity. For this reason, the brain area of each neuron was omitted in further exploratory analysis (and subsequent model building), to focus on better understanding the relationship between other predictors. This decision was also made because a relationship between brain area and neural activity did not seem straightforward. With brain area addressed, One avenue to compare neurons across trials and sessions was by analyzing the proportion of all neurons fired at different time intervals within specific trials. Pictured below are plots made analyzing the change in proportion of neurons fired over time, sampling the first 5 trials from session 1 (Cori) and session 18 (Lederberg).

These graphs looked similar across a wider variety of trials and sessions (including sessions with the other two mice). This method of analyzing neural activity did not result in meaningful insights, which led to a change in analysis. The average number of spikes per neuron per trial was calculated and plotted, with no meaningful relationship found. Eventually, the average number of spikes for every active neuron per trial was calculated, giving 1 value to represent ‘average neural activity’ per trial, for every trial in the 18 sessions. In graphing the average number of total spikes for all active neurons per trial (for a sample of the first 100 trials for all 18 sessions), it was possible to begin understanding differences in neural activities across sessions and across trials (and subsequently, as time within a session went on). A sample of the first 100 trials per session was used to account for the variety of trials per session (as it would not be useful to compare change over number of trials for a session with four times as many trials as another). These first 100 trials can be interpreted in the context of the relationship between trial and response variable from the start of and well into each session. While this gives a good impression for change across trial, it is not necessarily representative of the change across the entire session. Nevertheless, this information can be used to understand potential relationships between different predictors and factors.

As seen by the plot pictured above, there is great variety across sessions when modeling the relationship between the average number of all active neuron spikes per trial (for the first 100 trials of all 18 sessions). Splines were used to track the rate of change of average neuron spikes, as simple linear relationships seemed ineffective in showing the true variability of the data. Session 6, for example, has a very low average number of spikes, and this average stayed relatively consistent as trials progressed. On the other hand, session 13 had a far higher average number of spikes when compared to the other trials. In addition to this, session 13 also remained relatively consistent across trials. Other sessions had noticeable change in the average number of spikes as trials progressed. Session 9, for example, has a negative relationship between trials and average spikes, whereas session 7 features a clear positive trend. Some of these example plots are pictured exclusively below.

After analyzing average neurons fired for first hundred trials when sorting by session, a next logical step became investigating neural activity by mouse. This involved a similar setup to the visualizations by session, but switching the color and spline for mouse instead of trial.

By investigating the relationship by mouse, it became clear that there were noticeable differences in average spikes. Interestingly, despite fitting splines to the scatterplots, the relationship between trial number and average neuron spikes was linear for each mouse (for the sample of the first 100 trials). This supports the notion of a linear relationship between trial number and average number of neuron spikes per mouse. That being said, the slope of the approximate lines pictured above are all relatively low (with the exception of Cori, comparatively), indicating that a change in trial may result in little change in average neuron spikes. This runs in opposition to the preliminary hypothesis that the number of trials a mouse partakes in may have an affect on the neural activity of that mouse. It is also interesting to note the close performance of Hench and Lederberg, and the average lower neural activity of Forssmann when comparing to the other three mice.

With the relationship between average neuron spikes per trial (which is being used to generalize ‘neural activity’), it is possible to introduce the main variable of interest for prediction - feedback type.

As shown above, as trial number progresses (for the first 100 trials for all 18 sessions), the average number of neuron spikes for a success and failure trial remain in fairly close proximity. As time goes on, a small relationship can be observed, where the average number of neuron spikes for successful trials trends slightly negatively and the average number of neuron spikes for unsuccessful trials trends slightly positively. Generally, it seems that trials that resulted in success (regardless of session) tend to have slightly higher average neuron spikes across all neurons. This could indicate some sort of meaningful relationship between average neuron spikes and trial number.

To further investigate this relationship, this plot can be rearranged and split to show success/failure in relation to trial and average neuron spikes for each session.

The above plots show that, for most sessions, successful trials follow the general relationships between trial and average neuron spikes. For trials that result in failure, however, there is much higher variability. Though some of this may be due to a lower number of unsuccessful trials, the more extreme variability is worth noting. This plot can also be recreated grouping by mouse, as shown below.

The two mice performing at the two ‘extremes’ of the data set (Cori with a higher average number of neuron spikes and Forssmann with a lower level) both remained relatively consistent with the average number of spikes per trial regardless if the trial was a success or failure. Hench and Lederberg, however (who had a similar number of average spikes as discussed earlier), had much larger variety in trials that resulted in failure, with the average number of spikes potentially being lower. This could represent a potential relationship between which mouse was the subject and the chance for a successful trial, making the mice a potentially relevant predictor.

To conclude the analysis of the relationship between neural activity (average neuron spikes) and other predictors, contrast (including left and right contrast) was factored in.

By looking at the average neuron spikes per trial, divided by the left and right contrast (and the strength of stimuli applied), it can be concluded that a lower strength of stimuli correlates with lower neural activity, especially for the right contrast. This relationship is less clear for left contrast, which might imply that the mice were not processing the left contrast as well as the right. For both left contrast and right contrast, it seems there is more clustering of lower average neuron spikes for 0 contrast trials, whereas the average spikes for a contrast of 1 is more evenly distributed throughout the graph.

As pictured by the above graphs, there is a clear difference between the average number of neuron spikes and contrast for right contrast successful trials (where higher contrast correlates with higher average spikes), but the number of average neuron spikes for right contrast failure trials was far more variable. For left contrast, there was a less clear relationship, for both success and failure trials. This shows that there may be a relationship between success and failure and contrast level, especially for right contrast.

After analyzing the relationship between the response variable and all possible predictors, it becomes prudent to expand from the first 100 trials to all available data, with the goal of calculating summary statistics that pertain to all trials and all sessions.

Within the available data from 18 sessions, the session-wide proportion of successful trials is 0.7100964.

Looking at the success rate of each session (averaging the proportion of successes of all trials), it is evident all seshions had relatively high success rates. That being said, only a few were higher than the average success rate of 0.71.

Looking at success rate for each mouse, it is clear each mouse performed relatively well. Lederberg is the only mouse to beat the average success rate, and Cori lags behind the group slightly. By referencing earlier visualizations, an interesting relationship begins to form. Cori has the highest level of average neuron spikes compared to the other mice across the first 100 trials, yet has the lowest success rate. This could point towards an inverse relationship, where as average neuron spikes go up, success rate goes down.

Other graphs were made, but no other significant findings could be taken from the other visualizations made. Through exploratory data analysis, it was determined that trial number, session number, left contrast, right contrast, mouse name, and average neural spikes could be potential predictors for feedback type.

Data Integration

Most data integration was done during exploratory data analysis, with the goal in mind of being able to use the most amount of data possible, to increase the accuracy of potential models. To do this, data for each session was reorganized into a data frame with rows equal to the number of trials and columns for session number, trial number, mouse name, left contrast, right contrast, feedback type, and average neuron spikes per trial. Each row was labelled with the session it was a part of, the trial number, the mouse for that particular session, the left and right contrast, the average neuron spikes for that trial, and whether the trial was a success or failure. This process was repeated for each session, and these sessions were eventually combined into one large data frame, with 5,081 rows corresponding to every trial across the 18 sessions. This allows for searching for session through filtering, but provides a clean way to access the most amount of data as possible. This also sets the data up well for preprocessing, by making it easy to mutate variables into new ones.

One example of this was in calculating a new variable to add to the predictive abilities of future models. This variable was contrast difference, which represents the absolute value difference between left and right contrast. If, for example, both left and right contrast had values of 1, then the absolute value difference would be 0. This variable was created and included as a potential predictor since it adds context toe the left and right contrast variables. It provides a tangible relationship between the two types of contrast, which could be helpful for understanding how they interact in affecting the ultimate outcome of a specific trial.

Finally, the data was integrated into this new structure to allow for convenient conversion of categorical variables into useful variables that can be fed into a prediction variable. Specifically, this allowed for explicit one-hot encoding of the miceName variable. This was done outright, but is also calculated implicitly by the modeling functions used later in the project.

Predictive Modeling

Using the information gained from exploratory data analysis and the fully integrated data set, model building consisted of trying various combinations of predictors using a logistic regression modeling method. Logistic regression was originally chosen due to its strength in binary classification, which was the main topic of this project. To train and test each model, the integrated data set was randomly sampled and split into two data frames, 80% of the original integrated data going into a training model and 20% going into a testing model. K-folds cross validation was attempted for some models, but produced inaccurate results, leading to a simple split approach being chosen for model evaluation and validation. In total, 8 models were built, with various predictive capabilities. Classification threshold was chosen based on model ROC curve (specifically, choosing the approximate y value, or true positive rate, for the section of the ROC curve closest to a 90 degree angle). Cohen’s kappa coefficient was also considered, as a classification threshold that increased accuracy but significantly lowered kappa value was not desired, under the assumption that a lower kappa would mean lower inter-rater reliability, and worse performance at distinguishing between positive and negative feedback. Regardless of threshold tuning, AUC was used as the main decision factor to pick a final logistic regression model.

A brief summary of the logistic regression models made is as follows. logModel1 uses leftContrast, rightContrast, and average neuron spikes as predictors. modelOHE1 uses leftContrast, rightContrast, average neuron spikes, and one-hot encoding variables for each mouse (Cori, Forssmann, Hench, and Lederberg) as predictors. modelOHE2 uses the same predictors, but dropped Lederberg as a predictor to avoid the singularity created by the one-hot encoded variables. modelOHE3 uses the glm() functions internal encoding, and swaps the custom one-hot variables with just miceName. It otherwise has the same predictors. modelCont1 features the inclusion of ‘contrast difference’ (as discussed earlier), using left contrast, right contrast, average neuron spikes, and contrast difference as predictors. modelSesh1 uses leftContrast, rightContrast, average neuron spikes, and session number as predictors (based on the exploratory analysis done concerning session number previously). modelCont2 drops many variables, using average neuron spikes, contrast difference, and session number as predictors. Finally, modelEvery1 uses every variable as a predictor, including average neuron spikes, contrast difference, session number, trial number, leftContrast, rightContrast, and mice name. To attempt a rigorous analysis of the model containing every predictor, backwards selection was briefly tried, which did not result in less predictors. Each model was trained on the randomly chosen 80% data subset, and tested on the 20% data subset. Confusion matrices were constructed for each model to compare their overall accuracy, and their ability to recognize both successful trials and unsuccessful trials. For the sake of brevity these confusion matrices (except for the final model, below) have been omitted.

The ROC curves for the logistic models created (pictured below), and the associated AUC scores were used to pick a final model.

auc1 0.59
aucOHE 0.576
aucOHE2 0.579
aucOHE3 0.593
aucCont1 0.607
aucSesh1 0.61
aucCont2 0.635
aucEvery1 0.496
Based on the predefined selection criteria of highest AUC (area under ROC curve), modelCont2 was selected as the best model. This logistic regression model uses average neuron spikes, contrast difference, and session number to predict feedback type. Testing this model, the following confusion matrix was constructed.
ActualVal
-1 1
PredictedVal
  -1 64 107
  1 231 615

Using this test data, the model had a misclassification rate of approximately 33%, with overall accuracy of around 66%. The kappa value was 0.0785, indicating slight agreement (better than random chance).

For comprehensiveness, modeling methods outside of logistic regression were also considered. Namely, XGBoost was chosen as an alternative modeling method. XGBoost was chosen as an alternative model as it is a scalable gradient-boosted decision tree, under the hypothesis that it may provide more accurate results when compared to the final logistic regression model chosen. Two XGBoost models were made, one including all predictors except the one-hot encoding variable for the mouse Hench (to avoid a rank deficient matrix). The second XGBoost model included all predictors, but excluded mice name as a factor entirely. These two models were both trained with a random 80% subset of the data, and tested on the remaining 20% from the original integrated data. Each model featured a maximum depth of 15, using 2 threads and 30 rounds for lowering root mean square deviation (RMSE). Confusion matrices were made for the test classification for each model, and the ROC curves were plotted (alongside the final logistic model modelCont2) to select an appropriate classification threshold, and to compare AUC with the logistic regression model. These plots are pictured below.

aucCont2 0.635
aucXG 0.711
aucXG1 0.694
Using AUC, modelXG was chosen as the final model (being chosen over the final logistic regression model as well). This model features all predictors except the one-hot encoded variable for the mouse Hench. Using the 20% subset of the original integrated data as test data, the following confusion matrix and statistics were calculated.
ActualVal
-1 1
PredictedVal
  -1 136 178
  1 124 579

Using this test data, the model had a misclassification rate of 29.695%, with an overall accuracy of 70.3%. The kappa value was 0.2696, indicating a fair amount of agreement.

This final model maximizes the accuracy of prediction, while managing the models ability to predict both successes and failures well. It has an AUC of 0.711, implying the model is able to separate between classes fairly well.

Prediction Performance Based on Test Sets

As a final test of the predictive capability of the final chosen model, modelXG (which is an XGBoost model), was tested with predetermined test data, given as a measure to test the processing and integration capability of the project and to evaluate the model once more. The test data was given in the format of two additional RDS files, containing two test sets of 100 randomly chosen trials from sessions 1 and 18. This data was processed and integrated into a data structure similar to the one used for model building done throughout the project (albeit with smaller scale). There were a few notable challenges in integrating this test data. For one, since the data only reflected trials from sessions 1 and 18, only the mice Cori and Lederberg were represented. Since the model relied on data from the other mice as well, one hot encoded variables for the other mice were made as well (which contained all zero values). Average neuron spikes and contrast difference was calculated in a similar method as done previously, and the resultant data frame matched the ordering and structure of the data frame used for all model building, allowing for seemless testing with this new data.

ActualVal
-1 1
PredictedVal
  -1 30 26
  1 25 119

Using this final test set, the final chosen modelXG has a misclassification rate of 25.5%, with an overall accuracy of 74.5%. It has a kappa vlaue of 0.3641, suggesting a fair amount of agreement. The confusion matrix, pictured above, shows the difference between true positive, true negative, false positive, and false negative.

Discussion

Overall, the final model chosen features a fair amount of accuracy and reliability in predicting the outcome of a trial (namely, whether it was a success or a failure). In doing so, the ultimate goal of the course project was met, as a model was found that can perform well in a predictive setting. This model may not be the optimal model for prediction, but performs well under the conditions set through exploratory data analysis and through the chosen modeling/validation techniques used. This modeling process allowed for a deeper, more fundamental understanding of the data, and allowed for practice with various statistical data science techniques.

There were several decisions made throughout the project that affect the final models predictive capability when ran against the test set. For one, as discussed previously, brain_area as a variable was omitted to increase the interpretability and decrease the complexity of neural activity as a measurement for feedback type. Only certain modeling techniques were considered (logistic regression and XGBoost), which may have prevented the theoretical optimal model from being chosen. These techniques were chosen due to familiarity and the ability for simple comparison through ROC and AUC. Additionally, using ROC and AUC as the ultimate decision criteria for model selection may have decreased the maximum accuracy for the model, in favor of a more valid and reliable one.

Some techniques were attempted to improve model performance with varying degrees of success. K-folds cross validation was tried as a substitute for simple data splitting, but with marginal success. This could be due to unfamiliarity with the intricacies of the caret package, and because of the format of the data being fed into CV functions. Additionally, when building the logistic regression models, backwards elimination variable selection was attempted, but did not result in different results. This could be because of some human error, or perhaps the full model did represent the best collection of predictors.

Ultimately, this project allowed for deeper understanding of the relationship between stimuli conditions, neural activity (and a few other factors), and the performance of a mouse in correctly turning a wheel. It allowed for a more rigid understanding of the underlying statistical methods used to analyze this relationship, and ultimately predict the outcome given the chosen factors.

Acknowledgements and References

https://www.geeksforgeeks.org/k-fold-cross-validation-in-r-programming/

https://xgboost.readthedocs.io/en/stable/R-package/xgboostPresentation.html

https://topepo.github.io/caret/

https://topepo.github.io/caret/train-models-by-tag.html#Logistic_Regression

https://rpubs.com/jkylearmstrong/logit_w_caret

https://topepo.github.io/caret/pre-processing.html

https://www.programmingr.com/error-in-glm-fitx-c1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-na-nan-inf-in-039y039/

https://www.statology.org/prediction-from-rank-deficient-fit-may-be-misleading/

https://medium.com/@shaileydash/understanding-the-roc-and-auc-intuitively-31ca96445c02

https://www.yourdatateacher.com/2021/06/14/are-you-still-using-0-5-as-a-threshold/

https://www.hackerearth.com/practice/machine-learning/machine-learning-algorithms/beginners-tutorial-on-xgboost-parameter-tuning-r/tutorial/

https://www.iguazio.com/glossary/classification-threshold/#:~:text=The%20most%20common%20method%20for,TN)%20at%20all%20classification%20thresholds.

https://stats.stackexchange.com/questions/123124/how-to-determine-the-optimal-threshold-for-a-classifier-and-generate-roc-curve

https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5

https://cran.r-project.org/web/packages/olsrr/vignettes/variable_selection.html

https://topepo.github.io/caret/measuring-performance.html

https://machinelearningmastery.com/one-hot-encoding-for-categorical-data/

https://www.linkedin.com/advice/0/how-can-you-handle-categorical-variables-logistic-regression-jdtzc#:~:text=While%20working%20with%20logistic%20regression,should%20use%20one%20hot%20encoding.

https://ggplot2-book.org/scales-colour

https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-019-0666-3

https://machinelearningmastery.com/k-fold-cross-validation/

https://www.digitalocean.com/community/tutorials/plot-roc-curve-r-programming

ChatGPT conversation log: https://chat.openai.com/share/6b1925e1-90a5-4ee9-acf8-0dd7ef5a9ba5 (Includes brief conversation on another subject at the beginning, as a new conversation was not started accidentally)

Course notes/lecture slides/TA discussions and demos were also referenced in the execution of this project.